fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2)#21
Conversation
…entity (v4.0.2) The chunked-flush edge writer built every cross-session edge with two MATCH clauses, silently dropping edges when endpoints were not yet committed. Fix: MERGE both endpoints to create :Node placeholders, re-key node identity on the universal :Node label so placeholders converge with later typed writes, add a :Node(node_id, workspace) uniqueness constraint as the atomicity guard for concurrent MERGEs, make the previously-dead universal-:Node backfill reachable, and drop the legacy plain index. The :Node label was introduced in PR #19 but its backfill shipped unreachable, leaving pre-#19 nodes untagged. The migration at startup (dedup → backfill → verify → drop legacy index → :Node constraint) safely brings the production graph into the state B′ assumes, before any write. Full operator runbook in docs/node-identity-migration.md. Bumps server to v4.0.2. Evidence: - tests/neo4j: 69 passed (includes RED→GREEN silent-drop proof + dirty-graph migration test) - Non-neo4j unit suite: 1373 passed, 2 skipped - ruff: clean on all changed files 🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
|
Tested #21 against a clone of a real 1.4M-node graph before merge — green light from our side. 👍 We snapshotted a live context-intelligence graph (1,398,163 nodes / 1,937,529 rels) into a throwaway Neo4j and ran the migration there. Production was never touched. Tests (independent run):
So: ~10s end-to-end at real scale, no OOM, no errors. The dead-code backfill now runs and completes, and the drop-legacy-index → create- One non-blocking suggestion: the Step-1 global dedup ( Net: safe to merge from a real-data-scale standpoint. The self-healing edge writer + backfill resurrection both check out. Nice work. |
|
Thanks for running it against a real 1.4M-node clone, @bkrabach — that's the validation that matters here. On the Step-1 dedup batching suggestion: I tried it two ways and both made things worse on a memory-constrained Neo4j, so I'm keeping the un-batched form. Evidence:
If we ever need hard bounding for a pathologically duplicated graph, the right shape is a streaming per-key delete that avoids any full-graph grouping — but that wants the |
Summary
Cross-session graph edges were silently dropped when an endpoint node was not yet committed at flush time. This makes the chunked-flush edge writer self-healing and unifies node identity on the universal
:Nodelabel so the fix holds across the whole ingest pipeline. Bumps the server to v4.0.2.Root cause
_edge_merge_cypherbuilt every cross-session edge with twoMATCHclauses:The two
MATCHclauses are an inner join: if either endpoint was not yet committed, the row produced no bindings, theMERGEnever ran, and the edge was dropped with no error, no log, no warning. This affects every cross-layer/cross-session edge the pipeline writes (SOURCED_FROM×27, plusHAS_PART,TRIGGERED,HAS_STEP, and ~20 other edge types), all of which legitimately reference endpoints that may not exist at the writing flush.Fix
_edge_merge_cypher:MATCH→MERGEboth endpoints, thenMERGEthe relationship. An absent endpoint becomes a:Nodeplaceholder rather than a dropped edge.MERGE (n:Session {…})→MERGE (n:Node {…}) SET n:Session. Identity keys on the universal:Nodelabel so a placeholder converges with the later typed write instead of forking into two nodes; the:Sessionlabel is still applied.ensure_neo4j_schema): add a:Node(node_id, workspace)uniqueness constraint as the atomicity guard for concurrent MERGEs (the role the:Sessionconstraint played for the Session MERGE); deduplicate(node_id, workspace)globally; run the universal-:Nodebackfill (previously unreachable behind an earlyreturn); drop the legacy plainidx_node_universalindex (a uniqueness constraint carries its own backing index and cannot coexist with a standalone index on the same key); log backfill completion.:Node-keyed identity.Migration
The universal
:Nodelabel was introduced in #19, but its backfill shipped unreachable, so the graph holds nodes written before #19 that lack the label.ensure_neo4j_schemamigrates idempotently — global dedup → backfill → verify → drop legacy index →:Nodeconstraint — before any write. On the large production graph, run it as a verified two-phase deploy. Full runbook, verification queries, and rollback:docs/node-identity-migration.md.Evidence
tests/neo4j(real Neo4j): 69 passed — includes a RED→GREEN proof of the silent drop (tests/neo4j/test_silent_edge_drop.py) and a dirty-graph migration test (tests/neo4j/test_node_identity_migration.py).ruffclean on all changed files.Reviewer checklist
ensure_neo4j_schemaorder: dedup → backfill → verify → drop legacy index →:Nodeconstraint, before any write (_flush_bodyawaits_ensure_schemafirst).:Nodeand SETs:Session.:Nodeuniqueness constraint.